This is my first attempt at a blogpost based on a Tidy Tuesday dataset. This week the Tidy Tuesday project is the complete set of dialogue from The Office (US), a show that my 15-year-old son and I have binged several times over. We are not quite at the point of having memorized all the dialogue, but we can see that point from here.
As I was about to post this article, I realized that the Tidy Tuesday repository contained not only a link to the schrute package, but also a link to the article in The Pudding by Caitlyn Ralph. Caitlyn used a different dataset, officequotes.net, than either the schrute package (see below for a discussion of its problems) or the file I found created by Abhinav Ralhan. I think Caitlyn’s careful analysis was made easier by locating cleaner data, and I wish I had used the same starting point.
In December, the schrute package was released by Brad Lindblad, and I happily began a holiday project of combing through the data contained within. However, I immediately found some problems with the data. For example, let’s look at one of the most (in)famous episodes of The Office, “Dinner Party” (season 4, episode 13):
library(schrute)
theoffice %>%
filter(episode_name == "Dinner Party") %>%
select(index, episode, character, text) %>%
head(10)
| index | episode | character | text |
|---|---|---|---|
| 16791 | 13 | Stanley | This is ridiculous. |
| 16792 | 13 | Michael | MAN! I WOULD LOVE TO BURN YOUR CANDLES! |
| 16793 | 13 | Phyllis | Do you have any idea what time we’ll get out of here? |
| 16794 | 13 | Jan | YOU BURN IT. YOU BUY IT! |
| 16795 | 13 | Michael | Nobody likes to work late, least of all me. Do you have plans tonight? |
| 16796 | 13 | Michael | OH GOOD. I’LL BE YOUR FIRST CUSTOMER! |
| 16797 | 13 | Jim | Nope I don’t, remember when you told us not to make plans ’cause we’re working. |
| 16798 | 13 | Jan | AND YOU’RE HARDLY MY FIRST! |
| 16799 | 13 | Michael | Yes I remember. Mmm, this is B.S. This is B.S. Why are we here? I am going to call corporate. Enough is enough, I’m - God, I’m so mad! This is Michael Scott, Scranton, well we don’t want to work. No we don’t! It’s not fair to these people. These people are my friends and I care about them! We’re not going to do it! Everybody I just got off the horn with corporate and basically I told them where they could stick their little overtime assignment. Go enjoy your Friday. |
| 16800 | 13 | Michael | THAT’S WHAT SHE SAID! THAT IS A 200 DOLLAR PLASMA SCREEN TV YOU JUST KILLED! Good luck paying me back on your zero dollars a year salary plus benefits, babe! |
As anyone who has seen that supremely cringey episode knows, the dialogue in all caps is a climactic confrontation between Michael and Jan. For some reason, the index of dialogue lines in the schrute dataset, which should be chronological, has interleaved the fight with dialogue earlier in the episode. This is a common but inconsistent problem with the dialogue in the file. The index variable is fatally flawed, with no evident remedy to order the lines correctly.
Thanks to Brad Lindblad for the work he put into the schrute package. However, I was not going to be able to do the analyses I wanted to do with it.
When I first tried out the schrute package in December, I realized it could not provide the information I was hoping for. After searching online for an alternative, I came upon this blog entry by Abhinav Ralhan. The file posted on that page (which I had to download and then import) does have lines in the correct order. It also combines the double episodes, so “Dinner Party” is now season 4, episode 9. We can also select by scene:
office_raw <- read_csv("the-office-lines - scripts.csv") %>%
select(id:scene, deleted, speaker, line_text)
office_raw %>% filter(season == 4, episode == 9, scene == 19) %>%
head(12)
| id | season | episode | scene | deleted | speaker | line_text |
|---|---|---|---|---|---|---|
| 20687 | 4 | 9 | 19 | FALSE | Jan | [Michael dips his steak into his wine] Can you not do that? It’s disgusting. |
| 20688 | 4 | 9 | 19 | FALSE | Michael | You know I have soft teeth, how can you say that? |
| 20689 | 4 | 9 | 19 | FALSE | Jan | Oops. |
| 20690 | 4 | 9 | 19 | FALSE | Michael | Excuse me for a second. [gets up from the table] |
| 20691 | 4 | 9 | 19 | FALSE | Jim | [to babysitter] So… how do you guys know each other? |
| 20692 | 4 | 9 | 19 | FALSE | Woman | I was his babysitter. |
| 20693 | 4 | 9 | 19 | FALSE | Pam | And now you guys are dating? |
| 20694 | 4 | 9 | 19 | FALSE | Dwight | Purely carnal and that’s all you need to know. |
| 20695 | 4 | 9 | 19 | FALSE | Jim | Would you write down your e-mail because I have just so many questions… |
| 20696 | 4 | 9 | 19 | FALSE | Woman | E-mail? |
| 20697 | 4 | 9 | 19 | FALSE | Jim | Nevermind. |
| 20698 | 4 | 9 | 19 | FALSE | Michael | Ok… alright… here we go. [takes down huge painting behind his seat and puts up a neon beer sign] There. [plugs it in] Oooookay. |
However, it has other problems, for instance, the dialogue for the later seasons has a large number of problematic characters.
office_raw %>%
filter(season == 9, episode == 12) %>%
select(line_text) %>%
head(10)
| line_text |
|---|
| Gotta clear out these file cabinets people, a lot of these are dead accounts. ���Scranton Mimeograph Corp?��� I don���t think we���re doing business with them any time soon. That���s odd. ��A letter from Robert Dunder. ���A valuable artifact has come into my possession. I have hidden it until such time as a person of strong intellect may safely recover it. This golden chalice is of immeasurable historical and religious significance.��� The Holy Grail. |
| [on phone]: Did you send Dwight on a quest for the Holy Grail? |
| I think I���m a little too busy these days to s— [whispering] Oh ,my God. I did send Dwight on a quest for the Holy Grail. |
| The Dunder Code! I completely forgot about that prank. That had to be like six or seven years ago. Stayed late every night for a month. Had a lot more free time back then. |
| I don���t get it. |
| Aha! A lightbulb. |
| A lightbul– |
| A lightbulb. Okay. Okay. [holding note over lamp] Invisible ink. |
| Whoa. |
| ���Higher than numbers go.��� The ceiling above accounting! |
I could not find an easy way to convert the bad characters to the punctuation marks they replaced. The best I could do is change them to a benign character that is unlikely to be part of dialogue, and I chose a tilde (~). The office_raw dataframe will be put through a couple of necessary cleanup steps as part of the import process.
# problem character � replaced with tilde to be more easily repaired
bad_char <- "�"
office_raw <- read_csv("the-office-lines - scripts.csv") %>%
select(id:scene, deleted, speaker, line_text) %>%
mutate(line_text = str_replace_all(line_text, bad_char, "~")) %>%
mutate(speaker = str_trim(speaker))
Like the schrute package, the new dataset has some misspelled character names.
office_raw %>%
select(speaker) %>%
str_extract_all("Mic[:alpha:]+") %>%
table()
## .
## Micael Micahel Michae Michael Michal Micheal Michel
## 1 4 2 12202 1 9 5
office_raw %>%
select(speaker) %>%
str_extract_all("Dar[:alpha:]+") %>%
table()
## .
## Darrly Darry Darryl Daryl
## 2 1 1258 37
Maybe a handful of misspelled character names is not a big deal to you. I am of a different temperament when it comes to data. The polite term is “tidy”.
I found two ways to clean up the character names. The first, naturally, was brute force. I simply made a summary table of all character names in the dataset, then looked at those that appeared infrequently, going on the assumption that many errors would occur only once or twice.
office_raw %>%
count(speaker) %>%
filter(n == 1) %>%
head(10)
| speaker | n |
|---|---|
| (Pam’s mom) Heleen | 1 |
| [Clark and Pete are shown on screen] | |
| Video Andy: Hey, I’m Pete, puberty is such a drag, man. And I’m Clark! I like to eat toilet paper. [Clark and Pete wave at camera] We fail! [Video shows memorial of Jerry 1 | |
| [repeats] | |
| Andy: Fail 1 | |
| 3rd Athlead Employee | 1 |
| 4th Athlead Employee | 1 |
| abe | 1 |
| Actress | 1 |
| All but Oscar | 1 |
| All Girls | 1 |
| All the Men | 1 |
Some are clearly errors (Heleen, Anglea, Carrol, Chares). Some are not. Several occur when a character is first introduced in the show and his/her full name is used, like Carol Stills. To see how daunting the cleanup process can be, let’s just see how many variations there are on minor character David Wallace:
office_raw %>%
select(speaker) %>%
str_extract_all("[:alpha:]+id W[:alpha:]+") %>%
table()
## .
## Dacvid Walalce Dacvid Wallace David Wallace David Wallcve
## 1 1 110 1
So I spent a long afternoon creating an endless series of search and replace functions.
office_tidy_chars <- office_raw %>%
mutate(speaker = str_replace(speaker, "Mic[:alpha:]+", "Michael")) %>%
mutate(speaker = str_replace(speaker, "[:alpha:]+hael", "Michael")) %>%
mutate(speaker = str_replace(speaker, "Dight", "Dwight")) %>%
mutate(speaker = str_replace(speaker,
"^Dwig[:alpha:]*[:punct:]*$", "Dwight")) %>%
mutate(speaker = str_replace(speaker, "Meridith", "Meredith")) %>%
mutate(speaker = str_replace(speaker, "Stanely", "Stanley")) %>%
mutate(speaker = str_replace(speaker, "sAndy", "Andy")) %>%
mutate(speaker = str_replace(speaker, "^Ang[:alpha:]+", "Angela")) %>%
mutate(speaker = str_replace(speaker, "^Darr[:alpha:]*", "Darryl")) %>%
mutate(speaker = str_replace(speaker, "Daryl", "Darryl")) %>%
mutate(speaker = str_replace(speaker, "Phyl[:alpha:]*", "Phyllis")) %>%
mutate(speaker = str_replace(speaker, "^abe$", "Gabe")) %>%
mutate(speaker = str_replace(speaker, "Holy", "Holly")) %>%
mutate(speaker = str_replace(speaker, "Chares", "Charles")) %>%
mutate(speaker = str_replace(speaker,
"[:alpha:]+id Wa[:alpha:]+", "David Wallace")) %>%
mutate(speaker = str_replace(speaker,
"Denagelo|DeAngelo|DeAgnelo", "Deangelo")) %>%
mutate(speaker = str_replace(speaker, "M Michael", "Michael")) %>%
mutate(speaker = str_replace(speaker, "^D$", "Dwight")) %>%
mutate(speaker = str_replace(speaker, "^Carro[:alpha:]+", "Carol")) %>%
mutate(speaker = str_replace(speaker, "Heleen", "Helene")) %>%
mutate(speaker = str_replace(speaker, "Mayers", "Meyers")) %>%
mutate(speaker = str_replace(speaker, "Liptop", "Lipton")) %>%
# most of the changes below were for consistency, not to correct errors
mutate(speaker = str_replace(speaker, "Andy/", "Andy and ")) %>%
mutate(speaker = str_replace(speaker, "Pam/", "Pam and ")) %>%
mutate(speaker = str_replace(speaker, "Andy/", "Andy and ")) %>%
mutate(speaker = str_replace(speaker, "Michael/", "Michael and ")) %>%
mutate(speaker = str_replace(speaker, "Deangelo/", "Deangelo and ")) %>%
mutate(speaker = str_replace(speaker, "Angela/", "Angela and ")) %>%
mutate(speaker = str_replace(speaker,
"Gabe/Kelly/Toby", "Gabe, Kelly, and Toby")) %>%
mutate(speaker = str_replace(speaker, "David Wallace", "David")) %>%
mutate(speaker = str_replace(speaker, "Todd Packer", "Todd")) %>%
mutate(speaker = str_replace(speaker, "Packer", "Todd")) %>%
mutate(speaker = str_replace(speaker, "Robert California", "Robert")) %>%
mutate(speaker = str_replace(speaker, "^Bob$", "Bob Vance")) %>%
mutate(speaker = str_replace(speaker, ", Vance Refrigeration", "")) %>%
mutate(speaker = str_replace(speaker, "Irving", "Erving")) %>%
mutate(speaker = str_replace(speaker, "^Julius$", "Julius Erving")) %>%
mutate(speaker = str_replace(speaker, "MeeMaw", "Mee-Maw")) %>%
mutate(speaker = str_replace(speaker, "&", "and")) %>%
mutate(speaker = str_replace(speaker, "worker", "Worker")) %>%
mutate(speaker = str_replace(speaker, "#", "")) %>%
mutate(speaker = str_replace(speaker, "CameraMan", "Cameraman")) %>%
mutate(speaker = str_replace(speaker, " Guy", " guy")) %>%
mutate(speaker = str_replace(speaker, " Employ", " employ")) %>%
mutate(speaker = str_replace(speaker, " Member", " member")) %>%
mutate(speaker = str_replace(speaker, " Phone", " phone")) %>%
mutate(speaker = str_replace(speaker, " Club", " club")) %>%
mutate(speaker = str_replace(speaker, " Manager", " manager")) %>%
mutate(speaker = str_replace(speaker, " Drive", " drive")) %>%
mutate(speaker = str_replace(speaker, " Crew", " crew")) %>%
mutate(speaker = str_replace(speaker, " Worker", " worker")) %>%
mutate(speaker = str_replace(speaker, " Teacher", " teacher")) %>%
mutate(speaker = str_replace(speaker, " Shareholder", " shareholder")) %>%
mutate(speaker = str_replace(speaker, " Pregnant", " pregnant")) %>%
mutate(speaker = str_replace(speaker, " Assistant", " assistant")) %>%
mutate(speaker = str_replace(speaker, " Guest", " guest")) %>%
mutate(speaker = str_replace(speaker, " Voice", " voice")) %>%
mutate(speaker = str_replace(speaker, " Mom", " mom")) %>%
mutate(speaker = str_replace(speaker, " Dad", " dad")) %>%
mutate(speaker = str_replace(speaker, " Father", " father")) %>%
mutate(speaker = str_replace(speaker, " Brother", " brother")) %>%
mutate(speaker = str_replace(speaker, " Sister", " sister")) %>%
mutate(speaker = str_replace(speaker, " Son", " son")) %>%
mutate(speaker = str_replace(speaker, " Girl", " girl")) %>%
mutate(speaker = str_replace(speaker, " Woman", " woman")) %>%
mutate(speaker = str_replace(speaker, " Man", " man")) %>%
mutate(speaker = str_replace(speaker, " Salesman", " salesman")) %>%
mutate(speaker = str_replace(speaker, "^Everybody$", "Everyone"))
office_raw %>%
distinct(speaker) %>%
nrow()
## [1] 793
office_tidy_chars %>%
distinct(speaker) %>%
nrow()
## [1] 730
The 793 characters have been condensed to 730. I could likely do better, but it may require case-by-case inspection.
A few days after writing out that interminable list of commands, I asked in the R4DS Slack channel if anyone had a better way, and Scott Came told me about the fuzzyjoin package created by David Robinson. It offers the possibility to join based on near-matches, an indispensible tool to hack through an arduous task more quickly.
The first thing I did was to identify the most common characters in a dataframe called base_chars, going on the assumption that all character names appearing over 100 times contained no misspellings. Then I rejoined it to the original data using the fuzzyjoin function stringdist_left_join and filtered for matches that did not fit perfectly.
# identify the 31 characters with 100+ lines of dialogue
base_chars <- office_raw %>%
count(speaker) %>%
filter(n >= 100) %>%
select(base_char = speaker)
# use fuzzy left_join, then filter for closest but not identical
corr_char <- office_raw %>%
count(speaker) %>%
stringdist_left_join(base_chars, by = c("speaker" = "base_char"),
method = "cosine",
max_dist = 0.2, distance_col = "Distance") %>%
arrange(Distance, desc(n)) %>%
filter(n <= 100 & Distance <= 0.101)
corr_char %>% head(40)
| speaker | n | base_char | Distance |
|---|---|---|---|
| Micheal | 9 | Michael | 0.0000000 |
| Micahel | 4 | Michael | 0.0000000 |
| Darrly | 2 | Darryl | 0.0000000 |
| Stanely | 2 | Stanley | 0.0000000 |
| Anglea | 1 | Angela | 0.0000000 |
| Denagelo | 1 | Deangelo | 0.0000000 |
| Dacvid Walalce | 1 | David Wallace | 0.0200421 |
| Dacvid Wallace | 1 | David Wallace | 0.0200421 |
| Miichael | 1 | Michael | 0.0438171 |
| Phylis | 2 | Phyllis | 0.0474207 |
| David Wallcve | 1 | David Wallace | 0.0488103 |
| Daryl | 37 | Darryl | 0.0513167 |
| Holy | 2 | Holly | 0.0550888 |
| Darry | 1 | Darryl | 0.0645857 |
| M ichael | 1 | Michael | 0.0645857 |
| Michel | 5 | Michael | 0.0741799 |
| Michae | 2 | Michael | 0.0741799 |
| Chares | 1 | Charles | 0.0741799 |
| Dwight: | 1 | Dwight | 0.0741799 |
| Dwight. | 1 | Dwight | 0.0741799 |
| Micael | 1 | Michael | 0.0741799 |
| Michal | 1 | Michael | 0.0741799 |
| Mihael | 1 | Michael | 0.0741799 |
| Angel | 1 | Angela | 0.0871291 |
| Dight | 1 | Dwight | 0.0871291 |
| DeAngelo | 79 | Deangelo | 0.1000000 |
| Meridith | 2 | Meredith | 0.1000000 |
| DeAgnelo | 1 | Deangelo | 0.1000000 |
sum(corr_char$n)
## [1] 163
Using this approach, I can see that the first 28 fuzzy matches (Distance <= 0.101) identify misspelled character names. Note that the distance using the cosine method returns 0 (perfect match) for transposed letters, so I had to filter by number of matches < 100.
The fuzzyjoin approach rapidly diminishes the work required; I cleaned up 163 misspelled characters. However, 1) I still needed to manually inspect the matches to determine which were errors and which were legitimate entries, and 2) I didn’t find all the misspellings.
office_tidy_chars_fj <- office_raw %>%
left_join(select(corr_char, speaker, base_char), by = "speaker") %>%
group_by(id) %>%
mutate(speaker = coalesce(base_char, speaker)) %>%
select(-base_char)
diag_lines_fj <- office_tidy_chars_fj %>%
group_by(speaker) %>%
summarize(n = n()) %>%
arrange(desc(n))
diag_lines_fj %>%
stringdist_left_join(base_chars, by = c("speaker" = "base_char"),
method = "cosine",
max_dist = .5, distance_col = "Distance") %>%
group_by(speaker) %>%
filter(Distance > 0) %>%
arrange(Distance, desc(n)) %>%
head(10)
| speaker | n | base_char | Distance |
|---|---|---|---|
| Randy | 2 | Ryan | 0.1055728 |
| sAndy | 1 | Andy | 0.1055728 |
| Phyliss | 1 | Phyllis | 0.1111111 |
| Chelsea | 1 | Charles | 0.1180829 |
| Meredith’s Vet | 3 | Meredith | 0.1235402 |
| Deangelo/Michael | 2 | Deangelo | 0.1317569 |
| Joan | 5 | Jan | 0.1339746 |
| Rory | 2 | Roy | 0.1339746 |
| abe | 1 | Gabe | 0.1339746 |
| Molly | 3 | Holly | 0.1428571 |
For a dataset that contains sitcom dialogue, some degree of error is tolerable. For one that has critically important information, for example for a scientific publication, all of the errors must be tracked down. This dataset illustrates just how difficult that task can be.
In the end, although the fuzzyjoin approach was a time-saver, I returned to the brute force approach described earlier, using over 70 str_replace steps. That created the office_tidy_chars dataframe. I needed to run more checks to look for anomalous entries, e.g. dialogue misclassified as character names.
office_tidy_chars %>%
group_by(speaker) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
filter(n == 1 & str_length(speaker) > 30)
| speaker | n |
|---|---|
| [Clark and Pete are shown on screen] | |
| Video Andy: Hey, I’m Pete, puberty is such a drag, man. And I’m Clark! I like to eat toilet paper. [Clark and Pete wave at camera] We fail! [Video shows memorial of Jerry 1 | |
| Andy, Creed, Kevin, Kelly, Darryl | 1 |
| Female church member [to Michael] | 1 |
| Group: Dunder Mifflin! | |
| Andy: Andy Bernard presents: Summer Softball Epic Fails! [Kevin swings bat on screen, fart noise follows] Fail. [repeats] Fail 1 | |
| Meredith, Creed, Oscar and Matt | 1 |
| Oscar’s voice from the computer | 1 |
| Phyllis, Meredith, Michael, Kevin | 1 |
For the three problematic cases, I found it easiest to change them manually.
office_tidy_chars[53564, 5] <- office_tidy_chars[53564, 6]
office_tidy_chars[53565, 5] <- office_tidy_chars[53565, 6]
office_tidy_chars[53573, 5] <- office_tidy_chars[53573, 6]
office_tidy_chars[53564, 6] <- "Andy"
office_tidy_chars[53565, 6] <- "Andy"
office_tidy_chars[53573, 6] <- "Andy"
The use of the tilde to replace the bad characters in the data allowed me to repair some, though not all, of the altered words. The mutate steps reduced the number of bad lines from 2559 in office_tidy_chars to 763 in the new dataframe, office_tidy_dial.
office_tidy_dial <- office_tidy_chars %>%
mutate(line_text = str_replace_all(line_text, "~~~s", "'s")) %>%
mutate(line_text = str_replace_all(line_text, "~~~d", "'d")) %>%
mutate(line_text = str_replace_all(line_text, "~~~t", "'t")) %>%
mutate(line_text = str_replace_all(line_text, "~~~m", "'m")) %>%
mutate(line_text = str_replace_all(line_text, "~~~ll", "'ll")) %>%
mutate(line_text = str_replace_all(line_text, "~~~re", "'re")) %>%
mutate(line_text = str_replace_all(line_text, "~~~ve", "'ve")) %>%
mutate(line_text = str_replace_all(line_text, "~~~em", "'em")) %>%
mutate(line_text = str_replace_all(line_text, "~~~mon", "'mon"))
One final, critical practice to make my life easier: once I have the characters and dialogue reasonably cleaned up, save it as a file so I don’t have to recreate it!
write_csv(office_tidy_dial, "office_tidy_dialogue.csv")
Now we can finally have some fun with the data. How many times does Pam Beesly answer the phone by saying “Dunder Mifflin, this is Pam”?
office_tidy_dial %>%
filter(str_detect(line_text, "his is Pam") & str_detect(line_text, "Dunder"))
| id | season | episode | scene | deleted | speaker | line_text |
|---|---|---|---|---|---|---|
| 84 | 1 | 1 | 18 | FALSE | Pam | Dunder Mifflin. This is Pam. |
| 623 | 1 | 3 | 13 | FALSE | Pam | Dunder Mifflin, this is Pam. |
| 1144 | 1 | 4 | 50 | TRUE | Pam | [telephone rings] Dunder Mifflin, this is Pam. Hold please. |
| 1145 | 1 | 4 | 50 | TRUE | Michael | All righty then, well I see you’re going for the whole bored supermodel thing. ‘Dunder Mifflin, this is Pam. May I help you?’ [takes a drag from an imaginary cigarette] Smoke, smoke, smoke, smoke. |
| 1170 | 1 | 4 | 50 | TRUE | Pam | [smiling] Dunder Mifflin, this is Pam. One moment I’ll transfer you. |
| 3122 | 2 | 4 | 1 | FALSE | Pam | Dunder Mifflin, this is Pam. Sure, can I ask who’s calling? Just a second. |
| 3635 | 2 | 5 | 17 | FALSE | Pam | [on phone] Dunder-Mifflin. This is Pam. [listens] Uh, yeah. [snaps her fingers in the air, getting Jim’s attention] Just one second. I will, uh, transfer you to our manager, Michael Scott. |
| 4290 | 2 | 7 | 15 | FALSE | Pam | Dunder-Mifflin, this is Pam. |
| 4429 | 2 | 7 | 41 | FALSE | Pam | Dunder-Mifflin, this is Pam. |
| 5993 | 2 | 12 | 2 | FALSE | Pam | Dunder Mifflin, this is Pam. |
| 6332 | 2 | 12 | 35 | FALSE | Pam | Dunder Mifflin, this is Pam. |
| 6969 | 2 | 14 | 53 | FALSE | Pam | [voicemail message for Jim] I’ll transfer you. Dunder Mifflin, this is Pam. Hold, please. Dunder Mifflin, this is … okay, sorry. Michael was standing at my desk, and I needed to be busy or who knows what would’ve happened, so thank you. |
| 7333 | 2 | 15 | 52 | FALSE | Pam | Dunder Mifflin. This is Pam. Uh… hold, please. |
| 8752 | 2 | 20 | 56 | TRUE | Pam | [telephone ringing] Dunder Mifflin. This is Pam. Um, hold, please. [to Jim] There’s a Brenda on the phone for you. [to Brenda] Just one second, I’ll transfer. |
| 9037 | 2 | 21 | 60 | TRUE | Pam | [telephone ringing] Dunder Mifflin. This is Pam. Hold, please. Dwight, it’s the Sheriff. He said that it’s really important. It’s regarding your desk. I’ll transfer. |
| 9824 | 3 | 2 | 25 | FALSE | Pam | [answering phone] Dunder-Mifflin, this is Pam. He’s not in the office. Can I take a message? I will. You too. [hangs up] Sorry. What’s up? |
| 10275 | 3 | 3 | 62 | FALSE | Pam | Dunder Mifflin, this is Pam. … Uh, sure, I’ll get him for you. [to Michael] It’s Jan for you. |
| 10773 | 3 | 5 | 39 | FALSE | Pam | Dunder Mifflin, this is Pam. Oh, hi Jan. He’s, uh, on a sales call. No message? Bye, Jan. |
| 11260 | 3 | 7 | 18 | FALSE | Pam | It’s a blessing in disguise. Actually, not even in disguise. Sometimes at home, I answer the phone, ‘Dunder-Mifflin, this is Pam.’ So, maybe that’ll stop now. |
| 12915 | 3 | 11 | 21 | FALSE | Pam | [on phone] Dunder Mifflin, this is Pam. Just a second. Michael, it’s Jan on the phone for you. |
| 13219 | 3 | 12 | 23 | TRUE | Pam | Dunder Mifflin, this is Pam. [excited] This is Pam. I did? |
| 16925 | 3 | 23 | 77 | FALSE | Pam | [phone rings, Pam answers] Dunder Mifflin, this is Pam. Just one moment, I’ll transfer you. |
| 17218 | 4 | 1 | 35 | FALSE | Pam | Michael Scott’s Dunder-Mifflin, Scranton, Meredith Palmer memorial, celebrity rabies awareness, fun run race for the cure, this is Pam. |
| 18270 | 4 | 3 | 22 | FALSE | Pam | Dunder Mifflin. This is Pam. |
| 19471 | 4 | 5 | 30 | FALSE | Pam | Dunder Mifflin, this is Pam. |
| 21591 | 4 | 12 | 15 | FALSE | Pam | No. [Kevin leaves; Pam takes off her glasses; phone rings] Dunder Mifflin, this is Pam. Okay, go ahead. [puts a notepad close to her face and writes message] |
| 26108 | 5 | 11 | 1 | FALSE | Pam | [answering the phone] Dunder Mifflin, this is Pam. I’m sorry, he’s not in yet. Would you like his voicemail? |
| 26959 | 5 | 13 | 37 | FALSE | Pam | Dunder Mifflin this is Pam. Oh, hey Mom. No, what did Dad say? |
| 27038 | 5 | 13 | 58 | FALSE | Pam | Dunder Mifflin this is Pam. Uh, I’m sorry, Michael’s not here right now can I take a message? Great. I will. Thanks. |
| 27978 | 5 | 17 | 12 | FALSE | Pam | Dunder Mifflin, this is Pam. Oh hi ,David. [Michael shakes his head to Pam] No, I’m sorry he’s not back from the Civil Rights rally. I’ll have him call you the minute he gets back from the Lincoln Memorial. |
| 27998 | 5 | 17 | 14 | FALSE | Pam | [on phone] Dunder Mifflin, this is Pam. Oh hi, David. He’s having a colonoscopy. Alright, I’ll find out if he’s out yet. |
| 29276 | 5 | 21 | 38 | FALSE | Pam | Dunder Miff…Michael Scott Paper Company, this is Pam. Oh, hi Russell from the pancake luncheon, how are you? Well we’d like to do business with you too! How can we make that happen? |
| 42681 | 7 | 14 | 13 | FALSE | Pam | [on phone] Dunder Mifflin, this is Pam. |
| 54989 | 9 | 8 | 30 | FALSE | Pam | [into phone] Hello, this is Pam Halpert. I’m calling from Dunder-Mifflin. Yes, your paper provider. And I just called to say… your mama is so fat, when she wears red, people yell, ~ |
| 59858 | 9 | 23 | 94 | FALSE | Pam | [answering the phone] Dunder Mifflin, This is Pam. Oh, I’m sorry. Jim Halpert doesn’t work here anymore. |
I used the broader search string “his is Pam” to catch some slightly different phrasings as well as varying punctuation. However, that caught a few other uses of “this is Pam” that were not related to her answering the phone.
Full list of “That’s what she said” responses and who said the previous line.
shesaidvec <- str_which(office_tidy_dial$line_text, "hat she sai|HAT SHE SAI|hat She Sai")
shesaid <- office_tidy_dial[shesaidvec, ] %>%
bind_rows(office_tidy_dial[shesaidvec - 1, ]) %>%
arrange(id) %>%
head(10)
# write_csv(shesaid, "thats_what_she_said.csv")
In total, there were 39 instances of “that’s what she said” (thanks to Caitlyn Ralph’s article for helping me locate a couple of variants). So how could I create a reasonably tidy dataframe that would include both the response “that’s what she said” and the line that triggered it? I had two options:
As Sharla Gelfand wrote in one of her recent talks:
shesaid_full <- read_xlsx("thats_what_she_said.xlsx")
shesaid_full
| id | season | episode | scene | deleted | self | speaker | line_text | next_speaker | answer_text |
|---|---|---|---|---|---|---|---|---|---|
| 2544 | 2 | 2 | 24 | FALSE | FALSE | Jim | No, thanks. I’m good. | Michael | That’s what she said. Pam? |
| 2546 | 2 | 2 | 24 | FALSE | FALSE | Pam | Uh… my mother’s coming. | Michael | That’s what she sai [clears throat] Nope, but… Okay. Well, suit yourself. |
| 2590 | 2 | 2 | 34 | FALSE | FALSE | Michael | And in the future, if I want to say something funny or witty or do an impression, I will no longer, ever, do any of those things. | Jim | Does that include ‘That’s What She Said’? |
| 2593 | 2 | 2 | 34 | FALSE | FALSE | Jim | Wow! That is really hard. You really think you can go all day long? Well, you always left me satisfied and smiling, so… | Michael | THAT’S WHAT SHE SAID! |
| 5324 | 2 | 10 | 2 | FALSE | FALSE | Kevin | [holds up the piece of tree he just cut off with a paper cutter] Well, sort of. Why did you get it so big? | Michael | A, that’s what she said, and B, I wanted it to be impressive. The biggest day of the year deserves the biggest tree of the year. |
| 6321 | 2 | 12 | 33 | FALSE | FALSE | Doctor | Does the skin look red and swollen? | Dwight | That’s what she said. |
| 6352 | 2 | 12 | 38 | TRUE | FALSE | Oscar | [Jim popping Michael’s bubble wrap cast] You should put butter on it. | Michael | Uh, that’s what she said. See, haven’t lost my sense of humor. No, no need, it was a non-stick grill. |
| 7643 | 2 | 17 | 5 | FALSE | FALSE | Dwight | [eating grapes] | Michael | That’s what she said! |
| 8871 | 2 | 21 | 22 | FALSE | FALSE | Angela | You already did me. | Michael | That’s what she said. [Jim mouths these words along with Michael] The thing is, Angela… you are in here an awful lot. You have complained about everybody in the office, except Dwight, which is odd because everyone else has had run ins with Dwight. Toby, by the way, what does ‘redacted’ mean? There is a file full of complaints in here marked ‘redacted’… ? |
| 9623 | 3 | 1 | 48 | FALSE | TRUE | Michael | But you know what? Even if it didn’t, at least we put this matter to bed. | Michael | …that’s what she said. Or he said. |
| 10903 | 3 | 5 | 59 | FALSE | FALSE | Michael | I mean, they’re just dough twisted up with some candy. They taste so good in my mouth. | Stanley | That’s what she said. [Stanley and Michael both laugh] |
| 12593 | 3 | 10 | 49 | FALSE | FALSE | Second Cindy | Thanks! I, I wanna give you something. [She whispers in his ear. Michael starts to laugh] | Michael | Oh. That’s what she said. |
| 13336 | 3 | 12 | 41 | FALSE | FALSE | Michael | I want you to think about your future in this company. I want you to think about it long and hard. | Dwight | That’s what she said. |
| 14301 | 3 | 17 | 9 | FALSE | FALSE | Jan | Let’s just blow this party off. | Michael | That’s what she said. |
| 14373 | 3 | 17 | 22 | FALSE | TRUE | Jan | Why is this so hard? | Jan | That’s what she said. Oh my God. What am I saying? |
| 15385 | 3 | 20 | 11 | FALSE | TRUE | Michael | No, no. I need two men on this. | Michael | That’s what she said. No time! But she did. NO TIME! Guys, get on this. Dwight, I want you to be in charge of the press conference. |
| 16405 | 3 | 22 | 68 | FALSE | FALSE | Michael | No mustard! No mustard! Just… eat it. Eat it, Phyllis. Dip it in the water so it will slide down your gullet more easily. | Everyone | That’s what she said! |
| 17569 | 4 | 2 | 5 | FALSE | TRUE | Michael | Hey. Can you make that straighter? | Michael | That’s what she said. |
| 18959 | 4 | 4 | 44 | FALSE | TRUE | Michael | And the best way to start is to hit start. And up comes the toolbar, | Michael | that’s what she said. What we have to do here is go to Run, and then you look up to PowerPoint. And we are in. We are going to register. You hit register— Updates are ready. I should update. Um, estimated time 12 minutes, so this should take 5 or 10 minutes. |
| 20121 | 4 | 7 | 56 | FALSE | TRUE | Michael | That’s what I said. | Michael | That’s what she said. |
| 20124 | 4 | 7 | 56 | FALSE | FALSE | Michael | I never know. I just say it. I say stuff like that, you know, to lighten the tension. When things sort of get hard. | Jim | That’s what she said. |
| 20269 | 4 | 8 | 23 | FALSE | FALSE | Lester | And you were directly under her the entire time? | Michael | That’s what she said. |
| 20271 | 4 | 8 | 23 | FALSE | FALSE | Lester | Excuse me? | Michael | That’s what she said. |
| 20277 | 4 | 8 | 23 | FALSE | TRUE | Michael | Come again? | Michael | That’s what she said? I don’t know what you’re talking about. |
| 20282 | 4 | 8 | 23 | FALSE | TRUE | Deposition Reporter | [reading off paper] Mr. Schneider: And you were directly under her the entire time? Mr. Scott: | Deposition Reporter | That’s what she said. |
| 20715 | 4 | 9 | 19 | FALSE | FALSE | Jan | AND YOU’RE HARDLY MY FIRST! | Michael | THAT’S WHAT SHE SAID! [Jan gets an evil look on her face and picks up Michael’s dundie and throws it into his plasma screen tv] THAT IS A 200 DOLLAR PLASMA SCREEN TV YOU JUST KILLED! Good luck paying me back on your zero dollars a year salary plus benefits, babe! [Jan goes upstairs crying.] |
| 21480 | 4 | 12 | 2 | FALSE | FALSE | Dwight | And… go. [Michael sticks his face in the cement] Force it in as deep as you can. | Michael | [muffled] That’s what she said. |
| 23149 | 5 | 1 | 111 | FALSE | FALSE | Jim | Yeah, well, if you’re only free till three on Sunday and I can’t get there till one, then it’s gonna be pretty tight. | Michael | That’s what she said. |
| 23910 | 5 | 4 | 30 | FALSE | FALSE | Michael | It squeaks when you bang it, | Michael | that’s what she said. Let’s hear it for me! Right? A bargain at any price! |
| 24196 | 5 | 5 | 25 | FALSE | FALSE | Holly | Michael. Don’t. Don’t. Don’t make it harder than it has to be. | Michael | That’s what she said. |
| 24754 | 5 | 6 | 27 | FALSE | FALSE | Kelly | Dwight, get out of my nook! | Pam | [in New York] That’s what she said! That’s what she said! That’s what she said! |
| 28089 | 5 | 17 | 27 | FALSE | FALSE | David | Alright Dwight. This is huge. | Dwight | That’s what she said! [David laughs] |
| 36370 | 6 | 18 | 9 | FALSE | FALSE | Darryl | You need to get back on top. | Michael | That’s what she said. |
| 40586 | 7 | 8 | 29 | FALSE | FALSE | Gabe | Michael! You are making this harder than it has to be. | Michael | [grimacing] That’s what she said. [leaves] |
| 42259 | 7 | 13 | 1 | FALSE | TRUE | David | No, no. No, comedy is a place where the mind goes to tickle itself. | David | That’s what she said. [laughs]. [hugs Michaels] Ohh. |
| 43114 | 7 | 15 | 30 | FALSE | TRUE | Holly | I’m not saying it won’t be hard. But we can make it work. | Holly | That’s what she said. |
| 44695 | 7 | 21 | 50 | FALSE | TRUE | Michael | [pulls out his mic from his shirt] This is gonna feel so good, getting this thing off my chest. [he hands them the body mic, when he speaks it is inaudible now] | Michael | That’s what she said! [waves goodbye and walks off to his gate, halfway there Pam comes running up to him and they hug for a while. They say their goodbyes to each other, and Michael walks off for good] |
| 54087 | 9 | 5 | 29 | FALSE | FALSE | Clark | Wait! Wait. Hold on. Where’s the band? ’Cause there’s just no way you guys are making this magic with just your mouths. | Creed | Yeah. That’s what she said. |
| 59750 | 9 | 23 | 68 | FALSE | FALSE | Dwight | [turns around] [whispering] Michael. I can’t believe you came. | Michael | That’s what she said. |
links <- shesaid_full %>%
count(speaker, next_speaker)
nodes <- tibble(name = unique(c(shesaid_full$speaker, shesaid_full$next_speaker)),
index = 0:20)
new_links <- left_join(links, nodes, by = c("speaker" = "name")) %>%
select(speaker = index, next_speaker, n) %>%
left_join(nodes, by = c("next_speaker" = "name")) %>%
select(speaker, next_speaker = index, n)
In trying to find a way to visualize dialogue patterns, I happened upon the networkD3 package written by Christopher Gandrud et al.. One visualization offered by the package is a Sankey diagram that shows nodes and links. I decided to give it a try.
sn <- sankeyNetwork(Links = new_links, Nodes = nodes, Source = "speaker",
Target = "next_speaker", Value = "n", NodeID = "name",
fontSize = 20, nodeWidth = 30, height = 600, width = 1000)
sn